READ-BAD: A New Dataset and Evaluation Scheme for Baseline Detection in Archival Documents

نویسندگان

  • Tobias Grüning
  • Roger Labahn
  • Markus Diem
  • Florian Kleber
  • Stefan Fiel
چکیده

Text line detection is crucial for any application associated with Automatic Text Recognition or Keyword Spotting. Modern algorithms perform good on well-established datasets since they either comprise clean data or simple/homogeneous page layouts. We have collected and annotated 2036 archival document images from different locations and time periods. The dataset contains varying page layouts and degradations that challenge text line segmentation methods. Well established text line segmentation evaluation schemes such as the Detection Rate or Recognition Accuracy demand for binarized data that is annotated on a pixel level. Producing groundtruth by these means is laborious and not needed to determine a method’s quality. In this paper we propose a new evaluation scheme that is based on baselines. The proposed scheme has no need for binarization, it can handle skewed and rotated text lines and its results correlate with Handwritten Text Recognition accuracy. The ICDAR 2017 Competition on Baseline Detection and the ICDAR 2017 Competition on Layout Analysis for Challenging Medieval Manuscripts make use of this evaluation scheme.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Protection of Archival Documents from Photochemical Eects

Purpose: ­The purpose of this paper is to highlight the destructive effects of light on archival documents/paper materials. ­The research aims to explain the mechanism of photochemical degradation and the damaging effect of light on paper. It also tells us about the measures to be adopted to control the deteriorating effects of light on paper step by step. Design/Methodology/Approach: Th­e res...

متن کامل

Fact-Checking of Reports in Kalamat-e Anjoman Using Archival Documents on the History of Kashan during the Qajar Era

This research aims to do a content review on the materials printed and published in “Kalamat-e Anjoman” about the history of Kashan. This study also assesses these materials using archival documents in order to confirm or refute the contents. This research used a descriptive/analytical method and the data were obtained from “Kalamat-e Anjoman” and archival documents. Findings show that Abdolras...

متن کامل

Assessment Methodology for Anomaly-Based Intrusion Detection in Cloud Computing

Cloud computing has become an attractive target for attackers as the mainstream technologies in the cloud, such as the virtualization and multitenancy, permit multiple users to utilize the same physical resource, thereby posing the so-called problem of internal facing security. Moreover, the traditional network-based intrusion detection systems (IDSs) are ineffective to be deployed in the cloud...

متن کامل

Effective Learning to Rank Persian Web Content

Persian language is one of the most widely used languages in the Web environment. Hence, the Persian Web includes invaluable information that is required to be retrieved effectively. Similar to other languages, ranking algorithms for the Persian Web content, deal with different challenges, such as applicability issues in real-world situations as well as the lack of user modeling. CF-Rank, as a ...

متن کامل

The Reading Crisis in Iran (During the 1960s and 1970s): A Critical Discourse Analysis

Purpose: Reading is one of the challenging problems in contemporary Iran. After the Persian Constitutional Revolution (1905-1911), reading becomes one of the factors that Iranians considered it necessary for modernization and development. For this reason, most people, even who were literate, had no desire to read. This situation was unpleasant for intellectuals, publishers and cultural activist...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1705.03311  شماره 

صفحات  -

تاریخ انتشار 2017